Mastering Machine Learning with Spark 2.x by Tellez Alex & Pumperla Max & Malohlava Michal

Mastering Machine Learning with Spark 2.x by Tellez Alex & Pumperla Max & Malohlava Michal

Author:Tellez, Alex & Pumperla, Max & Malohlava, Michal [Tellez, Alex]
Language: eng
Format: azw3
Publisher: Packt Publishing
Published: 2017-08-31T04:00:00+00:00


Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.

As usual, we begin with starting the Spark shell, which is our working environment:

export SPARKLING_WATER_VERSION="2.1.12" export SPARK_PACKAGES=\ "ai.h2o:sparkling-water-core_2.11:${SPARKLING_WATER_VERSION},\ ai.h2o:sparkling-water-repl_2.11:${SPARKLING_WATER_VERSION},\ ai.h2o:sparkling-water-ml_2.11:${SPARKLING_WATER_VERSION},\ com.packtpub:mastering-ml-w-spark-utils:1.0.0" $SPARK_HOME/bin/spark-shell \ --master 'local[*]' \ --driver-memory 8g \ --executor-memory 8g \ --conf spark.executor.extraJavaOptions=-XX:MaxPermSize=384M \ --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=384M \ --packages "$SPARK_PACKAGES" "$@"

In the prepared environment, we can directly load the data:

val DATASET_DIR = s"${sys.env.get("DATADIR").getOrElse("data")}/aclImdb/train"

val FILE_SELECTOR = "*.txt" case class Review(label: Int, reviewText: String)

val positiveReviews = spark.read.textFile(s"$DATASET_DIR/pos/$FILE_SELECTOR")

.map(line => Review(1, line)).toDF

val negativeReviews = spark.read.textFile(s"$DATASET_DIR/neg/$FILE_SELECTOR")

.map(line => Review(0, line)).toDF

var movieReviews = positiveReviews.union(negativeReviews)

We can also define the tokenization function to split the reviews into tokens, removing all the common words:

import org.apache.spark.ml.feature.StopWordsRemover

val stopWords = StopWordsRemover.loadDefaultStopWords("english") ++ Array("ax", "arent", "re")



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.